Introduction to scRNAseq analysis using scanpy

Workshop

Author
Affiliation
Irene Robles

CoSyne Therapeutics

Published

March 1, 2024

Goal

Introduction to scRNAseq analysis using scanpy

Who am I?

  • Irene Robles Rebollo
  • BSc in Biochemistry
  • MSc in Applied Biosciences and Biotechnology
  • PhD in Cell Biology
    • Half experimental
    • Half computational
  • Bioinformatician at Cosyne Therapeutics

History

Year Achievement
1953 DNA structure
1985 First PCR - DNA amplification
1988 PCR with Taq polymerase - Automated DNA amplification
1993 First qPCRs - RNA amplification
1995 Microarrays - Quantify an array gene expression using a chip
2001 First draft of the human genome
2003 First NGS DNA sequencer - High throughput DNA sequencing
2006 First RNA-seq - High throughput RNA sequencing
2009 First single-cell RNA-seq (Tang et al. 2009)
2013 Single-cell RNA-seq is declared method of the year

(Zhu et al. 2020; Gondane and Itkonen 2023)

Single-cell vs bulk RNAseq

Bulk RNA-seq, LyfeFuel from Unsplash

scRNA-seq, VD Photography from Unsplash

Feature Bulk data Single-cell data
Cell resolution Average of all cells Individual cell resolution
Sample representation Vector of gene expression values Matrix of gene expression values
Genomic resolution Higher, depends on sequencing depth Lower, depends on starting material
Cost Lower High
Computational requirements Lower Higher
Data size Lower Higher
Data interpretation Simple Complex

Sample representation

  • In bulk data, each sample is repressented by a vector, where each value is a gene.
  • In single cell data, each sample is a matrix, where each row is a gene and each column is a cell.

\[\begin{align} Bulk &= \begin{bmatrix} gene_{1} \\ gene_{2} \\ gene_{3}\\ \vdots \\ gene_{n} \end{bmatrix} \\ \\ Single-cell &= \begin{bmatrix} gene_1, cell_1 & gene_1, cell_2 & gene_1, cell_3 & \dots & gene_1, cell_m \\ gene_2, cell_1 & gene_2, cell_2 & gene_2, cell_3 & \dots & gene_2, cell_m \\ gene_3, cell_1 & gene_3, cell_2 & gene_3, cell_3 & \dots & gene_3, cell_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ gene_n, cell_1 & gene_n, cell_2 & gene_n, cell_3 & \dots & gene_n, cell_m \end{bmatrix} \end{align}\]

Scanpy vs Seurat

  • Scanpy is a Python package for single-cell analysis (Wolf, Angerer, and Theis 2018)
  • Seurat is an R package for single-cell analysis (Satija et al. 2015)
  • Both are:
    • User-friendly tools for single-cell analysis
    • Open source
    • Well-documented
    • Widely-used
  • Choice depends on:
    • Language preference
    • Team expertise
    • Integration with downstream analysis
    • Speed and memory requirements

Hint: A good bioinformatician is not restricted by language. You can use R in Python can be done using the rpy2 package. And Python can be use within R using reticulate.

Scale of scRNAseq data

Number of cells per study over years (Svensson, Veiga Beltrame, and Pachter 2020)

AnnData object

AnnData object, source: scanpy web

Set up

  • Install Miniconda
  • Create a new environment
conda create -n myscanpy python=3.10
conda activate myscanpy
pip install -r requirements.txt
python -m ipykernel install --user --name myscanpy --display-name "Python (myscanpy)"

Import libraries

import scanpy as sc
import scrublet as scr
import numpy as np
import os
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns

Scanpy setttings

sc.settings.verbosity = 3   # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
scanpy==1.9.8 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.1 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11

Download data

mkdir data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE190nnn/GSE190622/suppl/GSE190622%5Fcount%5Fmatrix%5FAnnotated.csv.gz

Load tests datasets

Single-cell SMART-seq data from mouse macrophages (Robles-Rebollo et al. 2022)

mf = pd.read_csv("data/GSE190622_count_matrix_Annotated.csv.gz", index_col=0)
adata = sc.AnnData(X=mf.T)
adata.obs["genotype"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[0])
adata.obs["timepoint"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[1])
adata.obs["sample"] = adata.obs["genotype"] + "_" + adata.obs["timepoint"]
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 1362 × 18429
    obs: 'genotype', 'timepoint', 'sample'

Doublets

  • A doublet is a single-cell transcriptome that has been generated by two cells
  • They have often have a higher read and gene count
  • They can produce confounding results

Doublets, adapted from (Wolock, Lopez, and Klein 2019)

How do we fight doublets: - Experiment design - Computationally

Scrublet

Scrublet finds doublets based os simulations (Wolock, Lopez, and Klein 2019)

Scrublet algorithm overview (Wolock, Lopez, and Klein 2019)

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram()
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.41
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 25.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 0.3%
Elapsed time: 21.9 seconds
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Adjust threshold

predicted_doublets = scrub.call_doublets(threshold=0.2)
scrub.plot_histogram()
Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 50.8%
Overall doublet rate:
    Expected   = 6.0%
    Estimated  = 1.4%
(<Figure size 640x240 with 2 Axes>,
 array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
        <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
       dtype=object))

Visualise doublets

scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True);

Filter out doublets

adata.obs["scrublet_doublet"] = predicted_doublets
adata = adata[adata.obs["scrublet_doublet"].apply(lambda x: x is False)]

Preprocessing

Highest expressing genes

Look for suspects: MALAT1, mitochondrial genes, ribosomal genes, componenets of the cytoskeleton, etc.

sc.pl.highest_expr_genes(adata, n_top=20)
normalizing counts per cell
    finished (0:00:00)

Quality metrics

sc.pp.calculate_qc_metrics computes quality control metrics for each cell. - Number of counts per cell - Number of genes per cell - Percentage of counts that come from mitochondrial genes.

# Mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, rotation = 90)

Different sampes might require different thresholds

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True, groupby='sample', rotation = 90)

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color='sample')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='sample')

In this case, cells from different samples are reasonably homogeneous. However, that is not always the cells_per_paper.ipynb

Example case where different tissues might require different quality control threshols, fraction of Tabula Muris data (Schaum et al. 2018)

Filtering

Filter cells based on quality metrics

sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=5)
adata= adata[adata.obs.n_genes_by_counts < 5000, :]
adata= adata[adata.obs.pct_counts_mt < 15, :]
filtered out 1 genes that are detected in less than 5 cells

Normalisation: size normalisation and log transformation

For simplicity, we will use the simplest method: library size normalisation and log transformation, but there are others Hafemeister and Satija (2019).

Steps:

  • Transcript length normalisation:
    • Adjusts for differences in transcript length between genes
    • Divides the counts in each cell by the length of the transcript
    • Not necessary if you sequence a fixed region of the transcript ( 3’ or 5’ in 10X, our case)
  • Library size normalisation:
    • Adjusts for differences in sequencing depth between cells
    • Divides the counts in each cell by the total counts in that cell and multiplies by a scale factor (e.g. 10,000)
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
    finished (0:00:00)
  • Log transformation:
    • Logarithm of the normalised counts
    • Makes the data more normally distributed
sc.pp.log1p(adata)

Highly variable genes

  • Genes that show the most variation across cells
  • Often the most informative genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

Dimensionality reduction

Goal

  • Reducing the number of input variables or features in a dataset.

Why

  • Noise Reduction: Focus on the most variance-contributing features, thus enhancing the signal-to-noise ratio.
  • Data Visualization: We can plot in 2 or 3 dimensions
  • Aid clustering and clasification
  • Computational efficiency

Dimensionality reduction techniques

  • Principal Component Analysis (PCA): PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Popular dimensionality reduction that foccusses on the preservation on local structures
  • Uniform Manifold Approximation and Projection (UMAP): Popular dimensionality reduction that foccusses on the preservation on global structures
  • Autoencoders: Deep learning-based autoencoders can learn to compress single-cell data into a lower-dimensional representation in an unsupervised manner, capturing nonlinear relationships in the data.

Prepare data for dimensionality reduction

The .raw attribute freezes the state of the AnnData object for later use. We can recumerate it by calling .raw.to_adata().

adata.raw = adata

Filter by highly variable genes

adata= adata[:, adata.var.highly_variable]
adata
View of AnnData object with n_obs × n_vars = 1277 × 3193
    obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'sample_colors', 'log1p', 'hvg'

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
regressing out ['total_counts', 'pct_counts_mt']
    finished (0:00:02)

Scale each gene to unit variance. Clip values exceeding standard deviation to 10.

sc.pp.scale(adata, max_value=10)

Principal component analysis (PCA)

PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.

sc.tl.pca(adata, svd_solver='arpack')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:36)
sc.pl.pca(adata, color='sample')

Varinace explained by each component:

sc.pl.pca_variance_ratio(adata, log=True)

UMAP and t-SNE

First, we need to compute the neighbourhood graph:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)

Compute UMAP:

sc.tl.umap(adata)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:01)

Compute t-SNE:

sc.tl.tsne(adata)
computing tSNE
    using 'X_pca' with n_pcs = 50
    using sklearn.manifold.TSNE
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:00:03)

Plot UMAP:

sc.pl.umap(adata, color='sample', use_raw=False)

Plot t-SNE:

sc.pl.tsne(adata, color='sample', use_raw=False)

Clustering the neighborhood graph

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

sc.tl.leiden(adata)
sc.pl.umap(adata, color='leiden', use_raw=False)
running Leiden clustering
    finished: found 8 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
0 1 2 3 4 5 6 7
0 Ccl5 Ctsd Mapkapk2 Lgals3 Sepp1 Tnfsf9 Tubb5 Gm12155
1 Il1rn Laptm5 Ebi3 Ccnd1 Tnfsf9 Cxcl2 Tuba1b Gm42836
2 Isg15 Lamp1 Il1b Ifi27l2a Nfkbiz Tnf Top2a Gm11942
3 Saa3 Gm42418 Serp1 Adam8 Cxcl1 Ccl4 Hist1h2ap Gm13341
4 Nos2 Gas6 Ehd1 Lpl Serinc3 Id2 Hmgb2 Rplp1

During the course of this analysis, the AnnData accumlated the following annotations.

adata
AnnData object with n_obs × n_vars = 1277 × 3193
    obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'leiden'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'sample_colors', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'tsne', 'leiden', 'leiden_colors', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Inspirations

References

Gondane, Aishwarya, and Harri M Itkonen. 2023. “Revealing the History and Mystery of RNA-Seq.” Current Issues in Molecular Biology 45 (3): 1860–74.
Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” Genome Biology 20 (1): 296.
Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. 2018. “A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” Nature Communications 9 (1): 284.
Robles-Rebollo, Irene, Sergi Cuartero, Adria Canellas-Socias, Sarah Wells, Mohammad M Karimi, Elisabetta Mereu, Alexandra G Chivu, et al. 2022. “Cohesin Couples Transcriptional Bursting Probabilities of Inducible Enhancers and Promoters.” Nature Communications 13 (1): 4342.
Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial Reconstruction of Single-Cell Gene Expression Data.” Nature Biotechnology 33 (5): 495–502.
Schaum, Nicholas, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony Wyss-Coray, Spyros Darmanis, et al. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris: The Tabula Muris Consortium.” Nature 562 (7727): 367.
Svensson, Valentine, Eduardo da Veiga Beltrame, and Lior Pachter. 2020. “A Curated Database Reveals Trends in Single-Cell Transcriptomics.” Database 2020: baaa073.
Tang, Fuchou, Catalin Barbacioru, Yangzhou Wang, Ellen Nordman, Clarence Lee, Nanlan Xu, Xiaohui Wang, et al. 2009. “mRNA-Seq Whole-Transcriptome Analysis of a Single Cell.” Nature Methods 6 (5): 377–82.
Wolf, F Alexander, Philipp Angerer, and Fabian J Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19: 1–5.
Wolock, Samuel L, Romain Lopez, and Allon M Klein. 2019. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8 (4): 281–91.
Zhu, Hanliang, Haoqing Zhang, Ying Xu, Soňa Laššáková, Marie Korabečná, and Pavel Neužil. 2020. “PCR Past, Present and Future.” Biotechniques 69 (4): 317–25.